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Abstract 

Applying traditional collaborative filtering to 
digital publishing is challenging because user 
data is very sparse due to the high volume of 
documents relative to the number of users. Con¬ 
tent based approaches, on the other hand, is at¬ 
tractive because textual content is often very in¬ 
formative. In this paper we describe large-scale 
content based collaborative filtering for digital 
publishing. To solve the digital publishing rec- 
ommender problem we compare two approaches: 
latent Dirichlet allocation (LDA) and deep be¬ 
lief nets (DBN) that both find low-dimensional 
latent representations for documents. Efficient 
retrieval can be carried out in the latent represen¬ 
tation. We work both on public benchmarks and 
digital media content provided by Issuu, an on¬ 
line publishing platform. This article also comes 
with a newly developed deep belief nets toolbox 
for topic modeling tailored towards performance 
evaluation of the DBN model and comparisons to 
the LDA model. 


1. Introduction 

This article concerns the comparison of deep belief nets 
(DBN) and latent Dirichlet allocation (LDA) for finding a 
low-dimensional latent representation of documents. DBN 
and LDA are both generative bag-of-words models and rep¬ 
resent conceptual meanings of documents. Similar doc- 
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uments to a query document are retrieved from the low¬ 
dimensional output space through a distance measurement. 
A deep belief net toolbox (DBNT|^has been developed to 
implement the DBN and evaluate comparisons. The advan¬ 
tage of the DBN is that it has the ability of a highly non¬ 
linear dimensionality reduction, due to its deep architecture 
( [Hinton & Salakhutdinov 2006] ). A very low-dimensional 
representation in output space results in a fast retrieval of 
similar documents to a query document. The LDA model 
is a mixture model seeking to find the posterior distribu¬ 
tion between its visible and hidden variables ( jBlei et aL) 
2003| ). The number of topics K must be given for the 


LDA model defining the dimensionality of the Dirichlet- 
distributed output space. The latent representation of a 
document is the probability for the document to be in each 
topic, comprising of a vector of size K. To run simulations 
on the L DA model, we have used the Gensim package for 


Pythor|^|(|Rehufek & Sojka J2010 1. The article is conducted 


in collaboration with Issui j a digital publishing platform 
delivering reading experiences of magazines, books, cata¬ 
logs and newspapers. 


2. Deep Belief Nets 

The DBN is a direct acyclic graph except from the top 
two layers that form an undirected bipartite graph. The 
top two layers is what gives the DBN the ability to un¬ 
roll into a deep autoencoder (DA) and perform reconstruc¬ 
tions of the input data ( jBengioj |2009| ). The DBN consist 
of a visible layer, output layer and a number of hidden lay¬ 
ers. The training process of the DBN is defined by two 

^ Refer to Github.com Deep Belief Nets for Topic Modeling. 

^http://radimrehurek.com/gensim/models/ldamodel.html. 

^http://issuu.com 
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steps: pre-training and fine-tuning. In pre-training the lay¬ 
ers of the DBN are separated pairwise to form restricted 
Boltzmann machines (RBM). Each RBM is trained inde¬ 
pendently, such that the output of the lower RBM is pro¬ 
vided as input to the next higher-level RBM and so forth. 
This way the layers of the DBN are trained as partly inde¬ 
pendent systems. The goal of the pre-training process is to 
achieve approximations of the model parameters. A doc¬ 
ument is modeled by its word count vector. To model the 
word-count vectors the bottom RBM is a replicated soft- 
max model (RSM) ( [Salakhutdinov & Hinton| |2010| ). The 
hidden layers of the RBMs consist of stochastic binary 
units. Training are executed through Gibbs sampling using 
contrastive divergence as the approximation to the gradient 
( |Hinton| [2002| ). The RBMs applies to batch learning and 
the model only performs a single Gibbs step before updat¬ 
ing the weights ( [Hinton[[2Q12| ). Given a visible input vector 
V = [^ 1 ,the probability of a hidden unit j is given 
by 

D 

p{hj = 1|{)) = a{aj + ViWij), (1) 

where a denotes the logistic sigmoid function, Oj the bias 
for the hidden unit j, Vi the state of visible unit i, Wij the 
weight between visible unit i and hidden unit j and D de¬ 
notes the number of visible units. Except for the RSM, the 
visible units are binary, where the probability is given by 


M 

p{vi = l|ft) = a{bi + ^2 (2) 

i=i 


where hi denotes the bias of visible unit i and M the num¬ 
ber of hidden units. The RSM assumes a multinomial dis¬ 
tribution where the units of the visible layer are softmax 
units. Having a number of softmax units with identical 
weights is equivalent to having one multinomial unit sam¬ 
pled the same number of times ( [Salakhutdinov & Hinton| 
|2010| ). The probability of Vi taking on value n is 


p{vi = n\h) = 




Ef=ie' 
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(3) 


The RSM consider the number of words in each document 
by scaling the bias terms of the hidden units with the length 
of each document. The weights and biases of the RBM are 
updated by 




(4) 


~ Eprecon [^] ) > 

(5) 

Aa = 

-Brecon [^])> 

(6) 


where e is the learning rate and the distribution denoted 
Precon dcfincs the reconstruction of the input data Pdata and 


is the result of a Gibbs chain running a single Gibbs step. 


E„ 


is the expectation with respect to the joint distri¬ 


bution of the real data pdata{h,v) = Pdata{h\v)pdata{v). 

®Preoon ['] dcHotes the expectation with respect to the re- 
constructions. To optimize the training we add weight de¬ 


cay and momentum to the parameter update (Hinton & 


Salakhutdinpyj |2010| ). The model parameters from pre¬ 


training is passed on to the fine-tuning. The network is 
transformed into a DA, by replicating and mirroring the in¬ 
put and hidden layers and attaching them to the output of 
the DBN. Backpropagation on unlabeled data can be per¬ 
formed on the DA, by computing a probability of the input 
data p{x) instead of computing the probability of a label 
t provided the input data p{i\x). This way it is possible 
to generate an error estimation by comparing the normal¬ 
ized input data to the output probability. The stochastic 
binary units of the pre-training is replaced by sigmoid units 
with deterministic, real-valued probabilities. Since the in¬ 
put data is under a multinomial distribution, cross-entropy 
is applied as the error function. The conjugate gradient op¬ 
timization framework is used to produce new values of the 
model parameters that will ensure convergence. The DBN 
can output binary and real output values ( Salakhutdinov &| 
Hinton|[2009| ). The binary output values are computed by 


adding deterministic Gaussian noise to the input of the out¬ 
put layer during fine-tuning. This way the output of the 
logistic sigmoid function at the output units will be close to 
0 or 1 ( [Salakhutdinov & Hinton|[2Q09| ). The output values 
of the trained DBN are compared to a thresholqj in order 
to decide the binary value. Distance metrics when using 


binary output vectors are much faster ( Hinton & Salakhut- 
[dinov[[MT0| ). 


3. Simulations 


We have performed model evaluations on the 20 News- 
groups datase0 A dataset based on the Wikipedia Corpus 
is used to compare the DBN to the EDA model, since it 
contains labeled data. The Issuu Corpus has no labeled test 
set, so we compare the DBN to labels defined by a human 
perception of the topic distributions of Issuu’s EDA model. 
The models are evaluated by retrieving a number of similar 
documents to a query document in the test set and average 
over all possible queries. This provides a fraction of the 
number of documents in the test set having similar docu¬ 
ments in their proximity in the output spac^ The number 
of neighbors evaluated are 1,3, 7,15,31, and 63. The eval¬ 
uation is denoted the accuracy measurement. 


threshold of 0.1 is appli ed due to a high number of outp ut 
values lying closer to 0 than 1 (Hinton & Sala khutdinov 20T0|. 

^Refer to (Hinton & Salakhutdinov [2010 1 for details. 
^Euclidean distance and hamming distance is applied as dis¬ 
tance metric on real valued and binary output vectors respectively. 
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The learning rate is set to e = 0.01, momentum m = 0.9 
and a weight decay A = 0.0002. The weights are initial¬ 
ized from a 0-mean normal distribution with variance 0.01. 
The biases are initialized to 0 and the number of epochs 
are set to 50. The pre-training procedure applies to batch- 
learning, where each batch represents 100 documents. For 
fine-tuning, larger batches of 1000 documents are gener¬ 
ated. We perform three line searches for the conjugate gra¬ 
dient algorithm and the number of epochs is set to 50. The 
Gaussian noise for the binary output DBN, is defined as 
deterministic noise with mean 0 and variance 16 ( [Hinton &| 
jSalakhutdinovI [MTO] ). 


3.1. Model Evaluation 


From Fig. the DBNT performs in comparison to 
the model by Hinton and Salakhutdinov in ( [Hinton & 
Salakhutdinov[ [2010[ ). When comparing the real valued 
output DBN with a binary output DBN, we have observed 
that the accuracy measurements are very similar for a 
higher dimensional output vector (cf. Fig. |^. For the 
following simulations we have only considered real valued 
output vectors though. Fig. shows that the DBN manages 
to find an internal representation of the documents that are 
better than the high dimensional input vectors. 


— DBNT 2000-500-500-128 

— (Hinton & Salakhutdinov, 2010) 2000-500-500-128 



63 127 255 

Neighbors 


1023 2047 4095 7531 


Figure 1. Accuracy measurements of the 2000-500-500-128- 
DBN with binary output units from ( [Hinton & Salakhutdinov[ 
[2010[ ) and a 2000-500-500-128-DBN with binary output units 
from the DBNT. The models are trained on the 20 Newsgroups 
dataset. NB: The results from ( [Hinton & Salakhutdinov| [20T0] l 
are read directly of the graph. 


60,25 
^ 45.5 

§ 30,75 
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— 2000-500-500-128 bin — 2000-dimensional input 



Neighbors 


Figure 3. Accuracy measurements of the 2000-500-500-128- 
DBN with binary output units and the 2000-dimensional input 
vectors. 


tain articles from 12 subcategories from the Business cat¬ 
egory. We will use categories with a large pool of arti¬ 
cles and a strong connectivity to the remaining categories 
of the Wikipedia Corpus. The categories are administra¬ 
tion, commerce, companies, finance, globalization, indus¬ 
try, labor, management, marketing, occupations, sales and 
sports business. The Wikipedia Business dataset consists 
of 32843 documents split into 22987 (70%) training set 
documents and 9856 (30%) test set documents. Wikipedia 
Business provide an indication on how well the DBN and 
LDA model captures the granularity of the data within sub¬ 
categories of the Wikipedia Corpus. In order to compare 
the DBN model to the LDA model, we have computed 
accuracy measurements on a 2000-500-250-125-10-DBN 
with real numbered linear output units and accuracy mea¬ 
surements on two LDA models, one with K = 12 top¬ 
ics and another with K = 150 topics. The accuracy 
measurement of the 2000-500-250-125-10-DBN is outper¬ 
forming the two LDA models (cf. Fig. [^. The LDA 
model with K = 12 topics perform much worse than the 
DBN. The LDA model with a AT = 150 topics perform 
well when evaluating 1 neighbor, but deteriorates quickly 
throughout the evaluation points. The DBN is the superior 
model for dimensionality reduction on the Wikipedia Busi¬ 
ness dataset. Its accuracy measurements are higher and the 
output is 10-dimensional compared to the 150-dimensional 
topic distribution of the LDA model with the lowest error. 


— 2000-500-500-128 bin — 2000-500-500-128 real 



Figure 2. Accuracy measurements of two 2000-500-500-128- 
DBNs with binary output units and real valued output units trained 
on the 20 Newsgroups dataset. 


— LDA 12-dimensional topic distribution 



Figure 4. Accuracy measurements of two LDA models and a 
2000-500-250-125-10 DBN. 


3.2. Wikipedia Corpus 

We have generated a dataset based on the Wikipedia Cor¬ 
pus. The dataset is denoted Wikipedia Business and con- 


We have computed accuracy measurements for: 2000-500- 
250-125-2-DBN, 2000-500-250-125-10-DBN, 2000-500- 
250-125-50-DBN and 2000-500-250-125-100-DBN (cf. 
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Fig. [^. It is evident that the DBN with a 2-dimensional 
output scores a much lower accuracy measurement, due 
to its inability to differentiate between the documents. 
When increasing the number of output units by model¬ 
ing the 2000-500-250-125-50-DBN and the 2000-500-250- 
125-100-DBN, we see that they outperform the original 
2000-500-250-125-10-DBN. Even though one DBN has an 
output vector twice the size of the other, the two evalua¬ 
tions are almost identical, which indicates saturation. Fig. 

— 2000-500-250-125-2 — 2000-500-250-125-10 



38 

1 3 7 15 31 63 

Neighbors 


Figure 5. Accuracy measurements on DBNs with different num¬ 
ber of output units. 


shows how the DBN spreads the data in output space. 
Since PCA has its limitations it is not possible to visual¬ 
ize more categories unless an approach such as t-SNE is 
applied ( [van der Maaten & Hinton[[2008] ). 



Figure 6. PCA on the 1st and 2nd principal components on the 
test dataset input vectors and output vectors from a 2000-500-250- 
125-10-DBN. Left: PCA on the 2000-dimensional input. Right: 
PCA on the 10-dimensional output. 


3.3. Issuu Corpus 

To test the DBN on the Issuu dataset we have extracted 
a dataset across 5 categories defined from Issuu’s LDA 
model. The documents in the dataset belong to the cate¬ 
gories Business, Cars, Food & Cooking, Individual & Team 
Sports and Travel. The training set contains 13650 docu¬ 
ments and the test set contains 5850 documents. As men¬ 
tioned, Issuu has applied labels to the dataset from the re¬ 
sults of their LDA model with a 150-dimensional latent 
representation. In order to compare the models, we have 
performed accuracy measurements for the 2000-500-250- 
125-10-DBN on these labels (cf. Fig. [7]). From the accu¬ 


racy measurements it is evident how similar the results of 
the two models are. The big difference is that the DBN gen¬ 
erates a 10-dimensional latent representation as opposed 
to the 150-dimensional latent representation of the LDA 
model. 

— 2000-5CXD-250-125-10 


97 



Figure 7. Accuracy measurements of a 2000-500-250-125-10- 
DBN on the labels defined on the basis of Issuu’s LDA model. 


When plotting the test dataset output vectors of the 2000- 
500-250-125-10-DBN for the 1st and 2nd principal com¬ 
ponent, it is evident how the input data is cluttered and how 
the DBN manages to spread the documents in output space 
according to their labels (cf. Fig. By performing an 
analysis of the output space, categories such as Business 
and Cars are in close proximity to each other and far from 
a category like Food & Cooking. 



Figure 8. PCA on the 1st and 2nd principal components on the 
test dataset input vectors and output vectors from a 2000-500-250- 
125-10-DBN. Left: PCA on the 2000-dimensional input. Right: 
PCA on the 10-dimensional output. 


Exploratory data analysis on the Issuu Corpus show how 
the 2000-500-250-125-10-DBN maps documents into out¬ 
put spac^ We have chosen random query documents from 
different categories and retrieved 10 documents within the 
nearest proximity. When we query a car publication about 
an SUV, the 10 documents retrieved from output space are 
about cars. They are all publications promoting a new car, 
published by the car manufacturer. 7 out of the 10 related 
publications concern the same type of car. When compar¬ 
ing a query in output space with the same query in the high¬ 
dimensional input space, we see that the similar documents 
are more accurate in output space from a human perception. 

^Due to copyright issues and the terms of services/privacy pol¬ 
icy at Issuu the results are not visualized in this article. 
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4. Conclusion 

On the Wikipedia and Issuu corporas we have shown how 
the DBN is superior compared to the proposed LDA mod¬ 
els. The DBN manages to find a better internal representa¬ 
tion of the documents in an output space of lower dimen¬ 
sionality. The low dimensionality of the output space re¬ 
sults in fast retrieval of similar documents. A binary output 
vector of a larger dimensionality performs almost as good 
as a real valued output vector of equivalent dimensionality. 
Finding similar documents from binary latent representa¬ 
tions is even faster. 
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